Nonlinear Policy Gradient Algorithms for Noise-Action MDPs

نویسندگان

  • Krishnamurthy Dvijotham
  • Emanuel Todorov
چکیده

We develop a general theory of efficient policy gradient algorithms for Noise-Action MDPs (NMDPs), a class of MDPs that generalize Linearly Solvable MDPs (LMDPs). For finite horizon problems, these lead to simple update equations based on multiple rollouts of the system. We show that our policy gradient algorithms are faster than the PI algorithm, a state of the art policy optimization algorithm. We provide an alternate interpretation of the PIalgorithm that further justifies this: The PI algorithm is actually performing gradient descent with respect to a risk seeking objective, rather than the desired expected cost objective. For infinite horizon problems, we develop algorithms that only require estimation of a state value function, rather than a state-action Q function or advantage function. We develop policy gradient algorithms for all MDP formulations (finite horizon, infinite horizon, first exit) for arbitrary policy parameterizations. We demonstrate the effectiveness of the policy gradient algorithms on simple 2-D nonlinear dynamical systems and large linear dynamical systems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Policy gradients in linearly-solvable MDPs

We present policy gradient results within the framework of linearly-solvable MDPs. For the first time, compatible function approximators and natural policy gradients are obtained by estimating the cost-to-go function, rather than the (much larger) state-action advantage function as is necessary in traditional MDPs. We also develop the first compatible function approximators and natural policy g...

متن کامل

Risk-Constrained Reinforcement Learning with Percentile Risk Criteria

In many sequential decision-making problems one is interested in minimizing an expected cumulative cost while taking into account risk, i.e., increased awareness of events of small probability and high consequences. Accordingly, the objective of this paper is to present efficient reinforcement learning algorithms for risk-constrained Markov decision processes (MDPs), where risk is represented v...

متن کامل

Approximate Newton Methods for Policy Search in Markov Decision Processes

Approximate Newton methods are standard optimization tools which aim to maintain the benefits of Newton’s method, such as a fast rate of convergence, while alleviating its drawbacks, such as computationally expensive calculation or estimation of the inverse Hessian. In this work we investigate approximate Newton methods for policy optimization in Markov decision processes (MDPs). We first analy...

متن کامل

Uniform Convergence of Value Iteration Policies for Discounted Markov Decision Processes

This paper deals with infinite horizon Markov Decision Processes (MDPs) on Borel spaces. The objective function considered, induced by a nonnegative and (possibly) unbounded cost, is the expected total discounted cost. For each of theMDPs analized, the existence of a unique optimal policy is assumed. Conditions that guarantee both pointwise and uniform convergence on compact sets of the minimiz...

متن کامل

A Gauss-Newton Method for Markov Decision Processes

Approximate Newton methods are a standard optimization tool which aim to maintain the benefits of Newton’s method, such as a fast rate of convergence, whilst alleviating its drawbacks, such as computationally expensive calculation or estimation of the inverse Hessian. In this work we investigate approximate Newton methods for policy optimization in Markov decision processes (MDPs). We first ana...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012